-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184
[SPARK-19843] [SQL] UTF8String => (int / long) conversion expensive for invalid inputs #17184
Conversation
Jenkins test this please |
Test build #74061 has finished for PR 17184 at commit
|
cc @cloud-fan for review |
also, cc @gatorsmile |
…or invalid inputs
Ideally we need a better way to indicate an invalid input, as exception handling is too expensive. Adding a check before process is one approach, but as you said, it has performance penalty if all records are valid. How about we use C-style function and return the result via parameter?. e.g.
The caller side can reuse the result object to reduce cost |
|
||
@Test | ||
public void testIsIntMaybe() throws IOException { | ||
assertEquals(true, UTF8String.fromString("1").isIntMaybe()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertTrue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
assertEquals(true, UTF8String.fromString(String.valueOf(Integer.MIN_VALUE)).isIntMaybe()); | ||
|
||
Random rand = new Random(); | ||
for(int i = 0; i < 10; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, which lint might catch: space after for.
Seed the RNG for determinism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- fixed linter
- I dont want determinism so not setting seed
|
||
if (!Character.isDigit(firstByte)) { | ||
// if the first character isn't a digit, then it has to be either `+` OR `-` | ||
if ((firstByte == '-' || firstByte == '+')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra parens here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is removed as I have changed the approach
import org.apache.spark.unsafe.types.UTF8String | ||
import org.apache.spark.util.Benchmark | ||
|
||
object UTF8StringBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the convention is, but I don't know if we need to preserve this benchmarking code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the benchmark code. Had originally attached in case anyone wanted to verify if the benchmarking was sane
adcb405
to
a91ca1b
Compare
if (b >= '0' && b <= '9') { | ||
return b - '0'; | ||
public static class LongWrapper { | ||
private long value = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's for internal use only, shall we make it public? We can probably save some function calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed getter and setter. changed to public
case _: NumberFormatException => null | ||
val result = new IntWrapper() | ||
buildCast[UTF8String](_, s => if (s.toShort(result)) { | ||
result.getValue.asInstanceOf[Short] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use toShort
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Generated code (in case anyone reviewing the PR wants to see):
|
Updated the perf. numbers with latest version of code |
Test build #74156 has finished for PR 17184 at commit
|
Test build #74159 has finished for PR 17184 at commit
|
Test build #74161 has finished for PR 17184 at commit
|
thanks, merging to master! |
return b - '0'; | ||
} | ||
throw new NumberFormatException(toString()); | ||
public static class LongWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil can you submit a follow up small pr to add classdoc for this? would be great to explain why we have this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followup PR : #17205
return true; | ||
} | ||
|
||
public static class IntWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and perhaps move this closer to LongWrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually why bother having an IntWrapper? Can't you use LongWrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea the calculation still use int/long respectively, the wrapper is only used for holding the result. There should be not much performance penalty to always use LongWrapper
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using IntWrapper for integers gives better perf. If we use LongWrapper for integers, there would be a conversion needed from long -> int. Its not that big of a difference but given that ints are used heavily in workloads, I dont want to leave that behind.
Here is microbenchmark result for current approach VS using LongWrapper everywhere
conversion to int: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
IntWrapper 20397 / 20564 26.3 38.0 1.0X
LongWrapper 20855 / 21530 25.7 38.8 1.0X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's fair. In that case let's document this (otherwise somebody might come in and remove this in the future..)
I believe IBM J9 actually improved this specific case (their JIT handles tons of exceptions better). Oh well -- if only JIT is perfect. |
## What changes were proposed in this pull request? This is as per suggestion by rxin at : #17184 (comment) ## How was this patch tested? NA as this is a documentation change Author: Tejas Patil <tejasp@fb.com> Closes #17205 from tejasapatil/SPARK-19843_followup.
What changes were proposed in this pull request?
Jira : https://issues.apache.org/jira/browse/SPARK-19843
Created wrapper classes (
IntWrapper
,LongWrapper
) to wrap the result of parsing (which are primitive types). In case of problem in parsing, the method would return a boolean.How was this patch tested?
Performance
Tiny regression when all strings are valid integers
Huge gain when all strings are invalid integers